[feat] Track entropy and MI of routing distribution for topk MoE #188

oleksost · 2025-03-14T21:07:02Z

✨ Description

To better detect potential routing collapse and have a better understanding about the routing distribution, we can track the average entropy and mutual information of routing probabilities.

Collapse routing would have low entropy and low mutual information. A healthy and specialised router would have low entropy and high mutual information, meaning that routing is specialised and considerably different across tokens.

More specifically:
Mutual info. measures the difference between:

The entropy of the average distribution across all tokens.
The average of the individual token entropies.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

added calculation of both metrics in the mixture_of_experts.py, they are calculated only for the topk routing type.

✅ Checklist

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

I am not 100% sure there is no performance impact, we are calculating the stats at each forward pass through the router.

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

tscholak

idea is good, thanks @oleksost.
bit weird that all these metrics are appearing as losses. that name should be reserved for things for which gradients are computed. just call this dict metrics?

oleksost · 2025-03-15T00:21:45Z

Yes @tscholak, addressed. Using metrics dict instead.

jlamypoirier

Looks good, got some comments on the structure.

jlamypoirier · 2025-03-19T22:32:14Z

fast_llm/engine/base_model/base_model.py

@@ -135,3 +135,7 @@ def get_tied_weights(self) -> dict[str, tuple[ParameterMeta, tuple[int, ...]]]:
    @abc.abstractmethod
    def loss_defs(self) -> list[LossDef]:
        pass
+
+    @property


This loss/metric split is way more complicated than needed. How about having a single entry, and using a is_metric flag in LossDef (or a derived class) to distinguish? Then no change is needed other than extracting metrics from the context before returning from run_step

jlamypoirier · 2025-03-19T22:33:27Z

fast_llm/engine/schedule/runner.py

@@ -289,6 +312,19 @@ def _reduce_losses(self, context: BatchContext) -> dict[str, float | int]:
            for name, reduced_loss in reduced_losses.items()
        }

+    def _is_reduced_metric(self, metric_name: str) -> bool:
+        """Check if a metric should be reduced (is defined in a TransformerReducedMetrics subclass)."""
+        from fast_llm.layers.transformer.config import TransformerReducedMetrics


We can't use hard-coded values here. Suggestion above would fix it, or there are a few other ways to get this dynamically.

jlamypoirier · 2025-03-19T22:38:14Z

fast_llm/layers/transformer/mixture_of_experts.py

+
+
+            # Store these metrics        
+            if metrics is not None:


Given the extra computation involved, this should be enabled through a config parameter

how much compute are we talking about for these metrics? likely this won't be noticeable.

jlamypoirier · 2025-03-19T22:40:00Z

tests/test_moe_metrics.py

+    assert 0.0 < mutual_info < 1.0, f"Expected value between 0 and 1, got {mutual_info}"
+
+
+def test_edge_cases():


More explicit name?

jlamypoirier · 2025-03-19T22:41:17Z

tests/test_moe_metrics.py

+
+
+@pytest.fixture
+def setup_runner():


These don't belong here. How about test_runner.py?

why don't they belong here? this is fixture is only useful for the tests in this suite.

jlamypoirier · 2025-03-19T22:41:46Z

tests/test_moe_metrics.py

+
+
+
+if __name__ == "__main__":


jlamypoirier · 2025-03-19T22:43:27Z

fast_llm/layers/transformer/mixture_of_experts.py

@@ -26,6 +27,35 @@
 logger = logging.getLogger(__name__)


+def calculate_normalized_average_entropy(probs: torch.Tensor) -> torch.Tensor:


Could try @torch.compile on these for a free performance boost.

jlamypoirier · 2025-03-19T22:44:00Z

fast_llm/layers/transformer/mixture_of_experts.py

+    average_entropy = entropy_values.mean()  # Average over batch and tokens
+    return average_entropy / torch.log(torch.tensor(n_experts, dtype=probs.dtype, device=probs.device))
+
+def entropy(probs: torch.Tensor) -> torch.Tensor:


calculate_entropy

oleksost added 4 commits March 14, 2025 19:23

added mutual information and entropy for routing probs

2a7cf1b

format

dd85e84

pre-commits

aef18e7

improved

bef39d8

oleksost added the enhancement New feature or request label Mar 14, 2025

oleksost requested review from jlamypoirier and tscholak March 14, 2025 21:07

oleksost self-assigned this Mar 14, 2025

tscholak reviewed Mar 14, 2025

View reviewed changes

using metrics dict instead of losses

620ec76

oleksost requested a review from tscholak March 15, 2025 00:38

oleksost added 3 commits March 16, 2025 19:22

reduce metrics

7a93aee

check return_metrics before reducing metrics

eb617e8

check return metrics before reducing

440738a

oleksost marked this pull request as draft March 16, 2025 20:18

oleksost added 2 commits March 16, 2025 20:25

corrwect averaging with number of layers

e5f3c4b

device

27e2a5c

oleksost marked this pull request as ready for review March 17, 2025 14:44

jlamypoirier reviewed Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Track entropy and MI of routing distribution for topk MoE #188

[feat] Track entropy and MI of routing distribution for topk MoE #188

oleksost commented Mar 14, 2025 •

edited

Loading

tscholak left a comment

oleksost commented Mar 15, 2025 •

edited

Loading

jlamypoirier left a comment

jlamypoirier Mar 19, 2025

jlamypoirier Mar 19, 2025

jlamypoirier Mar 19, 2025

tscholak Mar 22, 2025

jlamypoirier Mar 19, 2025

jlamypoirier Mar 19, 2025

tscholak Mar 22, 2025

jlamypoirier Mar 19, 2025

jlamypoirier Mar 19, 2025

jlamypoirier Mar 19, 2025

		assert 0.0 < mutual_info < 1.0, f"Expected value between 0 and 1, got {mutual_info}"


		def test_edge_cases():

		@@ -26,6 +27,35 @@
		logger = logging.getLogger(__name__)


		def calculate_normalized_average_entropy(probs: torch.Tensor) -> torch.Tensor:

[feat] Track entropy and MI of routing distribution for topk MoE #188

Are you sure you want to change the base?

[feat] Track entropy and MI of routing distribution for topk MoE #188

Conversation

oleksost commented Mar 14, 2025 • edited Loading

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

tscholak left a comment

Choose a reason for hiding this comment

oleksost commented Mar 15, 2025 • edited Loading

jlamypoirier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleksost commented Mar 14, 2025 •

edited

Loading

oleksost commented Mar 15, 2025 •

edited

Loading